Online Appendix to Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph

نویسندگان

  • DMITRI V. KALASHNIKOV
  • SHARAD MEHROTRA
  • S. Mehrotra
چکیده

Notation. We will compute probabilities of certain events. Notation P(A) refers to the probability of event A to occur. We use E to denote event “E exists” for an edge E. Similarly, we use E 6∃ for event “E does not exist”. Therefore, P(E) refers to the probability that E exists. We will consider situations where the algorithm computes the probability of following (or, ‘going along’) a specific edge E, usually in the context of a specific path. This probability is denoted as P(E ). We will use dep(e1, e2) notation as follows: dep(e1, e2) = true if and only if events e1 and e2 are dependent. Notation P denote the path being currently considered. Table I summarizes the notation. The challenge. Figure 1 illustrates an interesting property of graphs with probabilistic edges: each such graph maps on to a family of regular graphs. Figure 1(a) shows a probabilistic graph where three edges are labeled with probability of 0.5. This probabilistic graph maps on to 2 regular graphs. For instance, if we assume that none of the three edges is present (the probability of which is 0.5) then the graph in 1(a) will be instantiated to the regular graph in Figure 1(b). Figures 1(c) and 1(d) show other two possible instantiations of it, each having the same probability of occurring of 0.5. The challenge in designing algorithms that compute any measure on such probabilistic graphs, including the connection strength measure, comes from the following observation. If a probabilistic graph has n independent edges, that are labeled with non-1 probabilities, then this graph maps into the exponential number (i.e., 2) of regular graphs, where the probability of each instantiation is determined by the probability of the corresponding combination of edges to exist. Algorithms that work with probabilistic graphs should be able to account for the fact that some of the edges exist only with certain probabilities. If such an algorithm computes a certain measure on a probabilistic graph it should avoid computing it näıvly by com-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Network Game with Attacker and Protector Entities

Consider an information network with harmful procedures called attackers (e.g., viruses); each attacker uses a probability distribution to choose a node of the network to damage. Opponent to the attackers is the system protector scanning and cleaning from attackers some part of the network (e.g., an edge or a path), which it chooses independently using another probability distribution. Each att...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

The Relationship Between Critical Thinking and Online Information Seeking Behavior in Postgraduate Students of Ahvaz Jundishapur University of Medical Sciences

Introduction: Web is a source of information for students. Online information seeking behavior is related to several factors. One of these factors is the skill of critical thinking and information analysis on the web. The present study was conducted for explanting online information behavior and relation with critical thinking in postgraduate student. Methods: The present research is descripti...

متن کامل

Mining How-to Task Knowledge From Online Communities

Nowadays, knowledge graphs have become a fundamental asset for search engines which need background commonsense knowledge for natural interactions. A fair amount of user queries seek information on problem-solving tasks such as painting a wall or repairing a bicycle. While projects like ConceptNet and Webchild have successfully compiled large amounts of knowledge on properties of objects in our...

متن کامل

Multi-Relational Record Linkage

Data cleaning and integration is typically the most expensive step in the KDD process. A key part, known as record linkage or de-duplication, is identifying which records in a database refer to the same entities. This problem is traditionally solved separately for each candidate record pair (followed by transitive closure). We propose to use instead a multi-relational approach, performing simul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006